{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Ames housing dataset\n",
    "\n",
    "In this notebook, we will quickly present the \"Ames housing\" dataset. We will\n",
    "see that this dataset is similar to the \"California housing\" dataset. However,\n",
    "it is more complex to handle: it contains missing data and both numerical and\n",
    "categorical features.\n",
    "\n",
    "This dataset is located in the `datasets` directory. It is stored in a comma\n",
    "separated value (CSV) file. As previously mentioned, we are aware that the\n",
    "dataset contains missing values. The character `\"?\"` is used as a missing\n",
    "value marker.\n",
    "\n",
    "We will open the dataset and specify the missing value marker such that they\n",
    "will be parsed by pandas when opening the file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\", na_values=\"?\")\n",
    "ames_housing = ames_housing.drop(columns=\"Id\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a first look at the available columns in this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ames_housing.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the last column named `\"SalePrice\"` is indeed the target that we\n",
    "would like to predict. So we will split our dataset into two variables\n",
    "containing the data and the target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"SalePrice\"\n",
    "data, target = (\n",
    "    ames_housing.drop(columns=target_name),\n",
    "    ames_housing[target_name],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a quick look at the target before to focus on the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the target contains continuous value. It corresponds to the price\n",
    "of a house in $. We can have a look at the target distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "target.plot.hist(bins=20, edgecolor=\"black\")\n",
    "plt.xlabel(\"House price in $\")\n",
    "_ = plt.title(\"Distribution of the house price \\nin Ames\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the distribution has a long tail. It means that most of the house\n",
    "are normally distributed but a couple of houses have a higher than normal\n",
    "value. It could be critical to take this peculiarity into account when\n",
    "designing a predictive model.\n",
    "\n",
    "Now, we can have a look at the available data that we could use to predict\n",
    "house prices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the dataframe general information, we can see that 79 features are\n",
    "available and that the dataset contains 1460 samples. However, some features\n",
    "contains missing values. Also, the type of data is heterogeneous: both\n",
    "numerical and categorical data are available.\n",
    "\n",
    "First, we will have a look at the data represented with numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_data = data.select_dtypes(\"number\")\n",
    "numerical_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the data are mainly represented with integer number. Let's have a\n",
    "look at the histogram for all these features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_data.hist(\n",
    "    bins=20, figsize=(12, 22), edgecolor=\"black\", layout=(9, 4)\n",
    ")\n",
    "plt.subplots_adjust(hspace=0.8, wspace=0.8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that some features have high picks for 0. It could be linked that this\n",
    "value was assigned when the criterion did not apply, for instance the area of\n",
    "the swimming pool when no swimming pools are available.\n",
    "\n",
    "We also have some feature encoding some date (for instance year).\n",
    "\n",
    "These information are useful and should also be considered when designing a\n",
    "predictive model.\n",
    "\n",
    "Now, let's have a look at the data encoded with strings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "string_data = data.select_dtypes(object)\n",
    "string_data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These features are categorical. We can make some bar plot to see categories\n",
    "count for each feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import ceil\n",
    "from itertools import zip_longest\n",
    "\n",
    "n_string_features = string_data.shape[1]\n",
    "nrows, ncols = ceil(n_string_features / 4), 4\n",
    "\n",
    "fig, axs = plt.subplots(ncols=ncols, nrows=nrows, figsize=(14, 80))\n",
    "\n",
    "for feature_name, ax in zip_longest(string_data, axs.ravel()):\n",
    "    if feature_name is None:\n",
    "        # do not show the axis\n",
    "        ax.axis(\"off\")\n",
    "        continue\n",
    "\n",
    "    string_data[feature_name].value_counts().plot.barh(ax=ax)\n",
    "    ax.set_title(feature_name)\n",
    "\n",
    "plt.subplots_adjust(hspace=0.2, wspace=0.8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting this information allows us to answer to two questions:\n",
    "\n",
    "* Is there few or many categories for a given features?\n",
    "* Is there rare categories for some features?\n",
    "\n",
    "Knowing about these peculiarities would help at designing the predictive\n",
    "pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">In order to keep the content of the course simple and didactic, we\n",
    "created a version of this database without missing values.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ames_housing_no_missing = pd.read_csv(\n",
    "    \"../datasets/ames_housing_no_missing.csv\"\n",
    ")\n",
    "ames_housing_no_missing.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It contains the same information as the original dataset after using a\n",
    "[`sklearn.impute.SimpleImputer`](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)\n",
    "to replace missing values using the mean along each numerical column\n",
    "(including the target), and the most frequent value along each categorical\n",
    "column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "\n",
    "numerical_features = [\n",
    "    \"LotFrontage\",\n",
    "    \"LotArea\",\n",
    "    \"MasVnrArea\",\n",
    "    \"BsmtFinSF1\",\n",
    "    \"BsmtFinSF2\",\n",
    "    \"BsmtUnfSF\",\n",
    "    \"TotalBsmtSF\",\n",
    "    \"1stFlrSF\",\n",
    "    \"2ndFlrSF\",\n",
    "    \"LowQualFinSF\",\n",
    "    \"GrLivArea\",\n",
    "    \"BedroomAbvGr\",\n",
    "    \"KitchenAbvGr\",\n",
    "    \"TotRmsAbvGrd\",\n",
    "    \"Fireplaces\",\n",
    "    \"GarageCars\",\n",
    "    \"GarageArea\",\n",
    "    \"WoodDeckSF\",\n",
    "    \"OpenPorchSF\",\n",
    "    \"EnclosedPorch\",\n",
    "    \"3SsnPorch\",\n",
    "    \"ScreenPorch\",\n",
    "    \"PoolArea\",\n",
    "    \"MiscVal\",\n",
    "    target_name,\n",
    "]\n",
    "categorical_features = data.columns.difference(numerical_features)\n",
    "\n",
    "most_frequent_imputer = SimpleImputer(strategy=\"most_frequent\")\n",
    "mean_imputer = SimpleImputer(strategy=\"mean\")\n",
    "\n",
    "preprocessor = make_column_transformer(\n",
    "    (most_frequent_imputer, categorical_features),\n",
    "    (mean_imputer, numerical_features),\n",
    ")\n",
    "ames_housing_preprocessed = pd.DataFrame(\n",
    "    preprocessor.fit_transform(ames_housing),\n",
    "    columns=categorical_features.tolist() + numerical_features,\n",
    ")\n",
    "ames_housing_preprocessed = ames_housing_preprocessed[ames_housing.columns]\n",
    "ames_housing_preprocessed = ames_housing_preprocessed.astype(\n",
    "    ames_housing.dtypes\n",
    ")\n",
    "(ames_housing_no_missing == ames_housing_preprocessed).all()"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}